Red Wine Quality by Juho Salminen

Introduction

This report explores the possibilities for identifying high quality red wines based on their chemical properties. Wine making and buying are often more art than science. Selecting a good wine for dinner from dozens of options is challenging for customers. Production also relies heavily on intuition and tradition. Although a tradition may well be good and correct, developing one is difficult and can take decades. An easier and more robust way to assess wine quality and issues in it would be beneficial.

As a first step towards an solution I will explore the wine quality dataset collected by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis (Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236). Dataset consists of measurements of various chemical properties of Portuguese vinho verde red wines and subjective quality assessment scores based on wine expert evaluations on a scale from 0 (very bad) to 10 (very excellent). My goal is to explore whether it is possible to identify good wines just based on the measurements of chemical properties. In order to do so, I plan first to explore the distributions and correlations of different features with quality scores. If promising features for classification are identified, I’d like to try fitting a few ‘quick and dirty’ classification models to the dataset as a proof of concept for a recommendation system.

More information on the dataset is available at: https://s3.amazonaws.com/udacity-hosted-downloads/ud651/wineQualityInfo.txt

Univariate Plots Section

## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
##   X fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 1           7.4             0.70        0.00            1.9     0.076
## 2 2           7.8             0.88        0.00            2.6     0.098
## 3 3           7.8             0.76        0.04            2.3     0.092
## 4 4          11.2             0.28        0.56            1.9     0.075
## 5 5           7.4             0.70        0.00            1.9     0.076
## 6 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

The dataset consists of 1599 observations of 12 variables. Variable X appears to be just an id. Quality of wines is measured by integers between 3 and 8. I suppose the full scale is from 1 to 10, but for some reason extreme values have not been used. Other variables are continuous measures of physical qualities of the wine.

Quality

## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18
## 
##    3    4    5    6    7    8 
##  0.6  3.3 42.6 39.9 12.4  1.1

Distribution of wine qualities is bell-shaped with median 6 and mean 5.636. The left tail appears longer, but the right tail is heavier. It might make sense to combine categories, as some of them have only a few observations.

## 
##  Low High 
## 1382  217
## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   5.000   5.409   6.000   6.000 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   7.000   7.000   7.000   7.083   7.000   8.000

It might be easier to work with only two categories of wines instead of the full range of evaluations. The buyers are likely more interested in whether a wine is worth buying or not instead of exact ratings. Mean quality score for low quality wines is 5.4 and for high quality wines the mean is 7.1.

Acidity

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity of the wines is concentated around the value 8, with some skew to the right. Most wines have the fixed acidity between 7 and 9.5. It will be interesting to see whether the best wines have the highest acidity. In that case they would be easy to identify.

wine[wine$fixed.acidity > 15, ]
##       X fixed.acidity volatile.acidity citric.acid residual.sugar
## 443 443          15.6            0.685        0.76            3.7
## 555 555          15.5            0.645        0.49            4.2
## 556 556          15.5            0.645        0.49            4.2
## 558 558          15.6            0.645        0.49            4.2
## 653 653          15.9            0.360        0.65            7.5
##     chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 443     0.100                   6                   43 1.00320 2.95
## 555     0.095                  10                   23 1.00315 2.92
## 556     0.095                  10                   23 1.00315 2.92
## 558     0.095                  10                   23 1.00315 2.92
## 653     0.096                  22                   71 0.99760 2.98
##     sulphates alcohol quality quality.bin
## 443      0.68    11.2       7        High
## 555      0.74    11.1       5         Low
## 556      0.74    11.1       5         Low
## 558      0.74    11.1       5         Low
## 653      0.84    14.9       5         Low

There are a few outliers. Weird, it looks like three of them could be repeated measurements of the same wine.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1300 1300           7.6             1.58           0            2.1
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density  pH
## 1300     0.137                   5                    9 0.99476 3.5
##      sulphates alcohol quality quality.bin
## 1300       0.4    10.9       3         Low

Volatile acidity is much lower than fixed acidity in absolute terms: mean volatile acidity is 0.578, compared to order of magnitude higher mean for fixed acidity. The distribution appears to have low variance with a few outlier to the right. The biggest outlier is of poor quality. Someone got wine-making really wrong?

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

Increasing the resolution reveals an interesting chasm in the middle of the distribution. Why is this?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

Many wines have zero or very little citric acid. Otherwise the distribution is quite flat until it starts to decrease around 0.5. There is a curious spike at this value, and a couple less distinct ones at lower values. It looks as if the wine makers might be aiming their wines to have the amount of citric acid either zero, 0.25 or 0.5. Maybe these spikes indicate different types of wines?

## 
##    0 0.49 0.24 0.02 0.26  0.1 0.01 0.08 0.21 0.32 0.03 0.09  0.3 0.31 0.04 
##  132   68   51   50   38   35   33   33   33   32   30   30   30   30   29 
##  0.4 0.42 0.39 0.12 0.22 0.25  0.2 0.23 0.33 0.06 0.34 0.44 0.48 0.07 0.18 
##   29   29   28   27   27   27   25   25   25   24   24   23   23   22   22 
## 0.45 0.14 0.19 0.29 0.05 0.27 0.36  0.5 0.15 0.28 0.37 0.46 0.13 0.47 0.52 
##   22   21   21   21   20   20   20   20   19   19   19   19   18   18   17 
## 0.17 0.41 0.11 0.43 0.38 0.53 0.66 0.35 0.51 0.54 0.55 0.68 0.63 0.16 0.57 
##   16   16   15   15   14   14   14   13   13   13   12   11   10    9    9 
## 0.58  0.6 0.64 0.56 0.59 0.65 0.69 0.74 0.73 0.76 0.61 0.67  0.7 0.62 0.71 
##    9    9    9    8    8    7    4    4    3    3    2    2    2    1    1 
## 0.72 0.75 0.78 0.79    1 
##    1    1    1    1    1

Actually the exact locations of the spikes are 0.24 and 0.49.

Residual sugar

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 34     34           6.9            0.605        0.12           10.7
## 325   325          10.0            0.490        0.20           11.0
## 326   326          10.0            0.490        0.20           11.0
## 481   481          10.6            0.280        0.39           15.5
## 1236 1236           6.0            0.330        0.32           12.9
## 1245 1245           5.9            0.290        0.25           13.4
## 1435 1435          10.2            0.540        0.37           15.4
## 1436 1436          10.2            0.540        0.37           15.4
## 1475 1475           9.9            0.500        0.50           13.8
## 1477 1477           9.9            0.500        0.50           13.8
## 1575 1575           5.6            0.310        0.78           13.9
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 34       0.073                  40                   83 0.99930 3.45
## 325      0.071                  13                   50 1.00150 3.16
## 326      0.071                  13                   50 1.00150 3.16
## 481      0.069                   6                   23 1.00260 3.12
## 1236     0.054                   6                  113 0.99572 3.30
## 1245     0.067                  72                  160 0.99721 3.33
## 1435     0.214                  55                   95 1.00369 3.18
## 1436     0.214                  55                   95 1.00369 3.18
## 1475     0.205                  48                   82 1.00242 3.16
## 1477     0.205                  48                   82 1.00242 3.16
## 1575     0.074                  23                   92 0.99677 3.39
##      sulphates alcohol quality quality.bin
## 34        0.52     9.4       6         Low
## 325       0.69     9.2       6         Low
## 326       0.69     9.2       6         Low
## 481       0.66     9.2       5         Low
## 1236      0.56    11.5       4         Low
## 1245      0.54    10.3       6         Low
## 1435      0.77     9.0       6         Low
## 1436      0.77     9.0       6         Low
## 1475      0.75     8.8       5         Low
## 1477      0.75     8.8       5         Low
## 1575      0.48    10.5       6         Low

Most wines have low amount of residual sugar, between about 1 and 3. Some examples have much higher amounts of residual sugars: the highest outlier has about 6 times higher amount of residual sugar than the average wine. They are perhaps of different type, like desert wines? None of the outliers are high quality wines.

Chlorides

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 18     18           8.1            0.560        0.28            1.7
## 20     20           7.9            0.320        0.51            1.8
## 43     43           7.5            0.490        0.20            2.6
## 82     82           7.8            0.430        0.70            1.9
## 84     84           7.3            0.670        0.26            1.8
## 107   107           7.8            0.410        0.68            1.7
## 152   152           9.2            0.520        1.00            3.4
## 170   170           7.5            0.705        0.24            1.8
## 227   227           8.9            0.590        0.50            2.0
## 259   259           7.7            0.410        0.76            1.8
## 282   282           7.7            0.270        0.68            3.5
## 292   292          11.0            0.200        0.48            2.0
## 452   452           8.4            0.370        0.53            1.8
## 693   693           8.6            0.490        0.51            2.0
## 731   731           9.5            0.550        0.66            2.3
## 755   755           7.8            0.480        0.68            1.7
## 1052 1052           8.5            0.460        0.59            1.4
## 1166 1166           8.5            0.440        0.50            1.9
## 1261 1261           8.6            0.635        0.68            1.8
## 1320 1320           9.1            0.760        0.68            1.7
## 1371 1371           8.7            0.780        0.51            1.7
## 1373 1373           8.7            0.780        0.51            1.7
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 18       0.368                  16                   56 0.99680 3.11
## 20       0.341                  17                   56 0.99690 3.04
## 43       0.332                   8                   14 0.99680 3.21
## 82       0.464                  22                   67 0.99740 3.13
## 84       0.401                  16                   51 0.99690 3.16
## 107      0.467                  18                   69 0.99730 3.08
## 152      0.610                  32                   69 0.99960 2.74
## 170      0.360                  15                   63 0.99640 3.00
## 227      0.337                  27                   81 0.99640 3.04
## 259      0.611                   8                   45 0.99680 3.06
## 282      0.358                   5                   10 0.99720 3.25
## 292      0.343                   6                   18 0.99790 3.30
## 452      0.413                   9                   26 0.99790 3.06
## 693      0.422                  16                   62 0.99790 3.03
## 731      0.387                  12                   37 0.99820 3.17
## 755      0.415                  14                   32 0.99656 3.09
## 1052     0.414                  16                   45 0.99702 3.03
## 1166     0.369                  15                   38 0.99634 3.01
## 1261     0.403                  19                   56 0.99632 3.02
## 1320     0.414                  18                   64 0.99652 2.90
## 1371     0.415                  12                   66 0.99623 3.00
## 1373     0.415                  12                   66 0.99623 3.00
##      sulphates alcohol quality quality.bin
## 18        1.28     9.3       5         Low
## 20        1.08     9.2       6         Low
## 43        0.90    10.5       6         Low
## 82        1.28     9.4       5         Low
## 84        1.14     9.4       5         Low
## 107       1.31     9.3       5         Low
## 152       2.00     9.4       4         Low
## 170       1.59     9.5       5         Low
## 227       1.61     9.5       6         Low
## 259       1.26     9.4       5         Low
## 282       1.08     9.9       7        High
## 292       0.71    10.5       5         Low
## 452       1.06     9.1       6         Low
## 693       1.17     9.0       5         Low
## 731       0.67     9.6       5         Low
## 755       1.06     9.1       6         Low
## 1052      1.34     9.2       5         Low
## 1166      1.10     9.4       5         Low
## 1261      1.15     9.3       5         Low
## 1320      1.33     9.1       6         Low
## 1371      1.17     9.2       5         Low
## 1373      1.17     9.2       5         Low

Distribution of chlorides resembles the one of residual sugar. Most values are thightly concentrated around 0.08 with a thin and long right tail all the way to 0.6. It looks as if there is a small concentration of wines around 0.4. Is this a distinct subtype or category of wines, or just an artefact in the data? With one exception the outliers are low quality wines. They are all dry (low residual sugar) and low on alcohol.

Sulfur dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Most wines have lowish amounts of free sulfur dioxide. The distribution is again right-skewed. In absolute terms the differences are large, from 1 g/l to 72 g/l.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Same story here as with free sulfur dioxide, but about an order of magnitude higher values. I wonder what is the relationship between free and total amounts of sulfur dioxide?

There is a slightly increasing trend in additional sulfur dioxide when amount of free sulfur dioxide increases. It is still quite common for most of the total sulfur dioxide being accounted for by free sulfur dioxide. I’m creating a new variable fixed.sulfur.dioxide by calculating the difference between the two measures.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   21.00   30.59   39.00  251.50
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1080 1080           7.9              0.3        0.68            8.3
## 1082 1082           7.9              0.3        0.68            8.3
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1080      0.05                37.5                  278 0.99316 3.01
## 1082      0.05                37.5                  289 0.99316 3.01
##      sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 1080      0.51    12.3       7        High                240.5
## 1082      0.51    12.3       7        High                251.5

Similar distribution as with total sulfur dioxide. There are two extreme outliers, both of them high quality. Features of these two wines are curiously similar: the only difference is in amount of total sulfur dioxide. Duplicate?

Density

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density is almost normally distributed around little less than 1, which makes sense as wine is mostly water, and alcohol is less dense than water. Density might actually be correlated with amount of alcholol.

There indeed is a downward trend with increasing alcohol levels. Stronger wines tend to be less dense.

pH

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 46     46           4.6             0.52        0.15            2.1
## 96     96           4.7             0.60        0.17            2.3
## 152   152           9.2             0.52        1.00            3.4
## 696   696           5.1             0.47        0.02            1.3
## 1317 1317           5.4             0.74        0.00            1.2
## 1322 1322           5.0             0.74        0.00            1.2
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 46       0.054                   8                   65 0.99340 3.90
## 96       0.058                  17                  106 0.99320 3.85
## 152      0.610                  32                   69 0.99960 2.74
## 696      0.034                  18                   44 0.99210 3.90
## 1317     0.041                  16                   46 0.99258 4.01
## 1322     0.041                  16                   46 0.99258 4.01
##      sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 46        0.56    13.1       4         Low                   57
## 96        0.60    12.9       6         Low                   89
## 152       2.00     9.4       4         Low                   37
## 696       0.62    12.8       6         Low                   26
## 1317      0.59    12.5       6         Low                   30
## 1322      0.59    12.5       6         Low                   30

pH of the wines is almost normally distributed around the mean pH 3.3. Wines are acidic. The few minor outliers are not interesting.

Sulphates

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000
##       X fixed.acidity volatile.acidity citric.acid residual.sugar
## 14   14           7.8            0.610        0.29            1.6
## 87   87           8.6            0.490        0.28            1.9
## 92   92           8.6            0.490        0.28            1.9
## 93   93           8.6            0.490        0.29            2.0
## 152 152           9.2            0.520        1.00            3.4
## 170 170           7.5            0.705        0.24            1.8
## 227 227           8.9            0.590        0.50            2.0
## 724 724           7.1            0.310        0.30            2.2
##     chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 14      0.114                   9                   29  0.9974 3.26
## 87      0.110                  20                  136  0.9972 2.93
## 92      0.110                  20                  136  0.9972 2.93
## 93      0.110                  19                  133  0.9972 2.93
## 152     0.610                  32                   69  0.9996 2.74
## 170     0.360                  15                   63  0.9964 3.00
## 227     0.337                  27                   81  0.9964 3.04
## 724     0.053                  36                  127  0.9965 2.94
##     sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 14       1.56     9.1       5         Low                   20
## 87       1.95     9.9       6         Low                  116
## 92       1.95     9.9       6         Low                  116
## 93       1.98     9.8       5         Low                  114
## 152      2.00     9.4       4         Low                   37
## 170      1.59     9.5       5         Low                   48
## 227      1.61     9.5       6         Low                   54
## 724      1.62     9.5       5         Low                   91

A relatively tight distribution with some skew and outliers to the right. Typical wines have sulphates between 0.5 and 0.8. All outliers are low quality wines low on residual sugar and alcohol.

Alcohol

## Warning in loop_apply(n, do.ply): position_stack requires constant width:
## output may be incorrect

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90
## 
##   9.5   9.4   9.8   9.2    10  10.5   9.3   9.6    11   9.7   9.9  10.9 
##   139   103    78    72    67    67    59    59    59    54    49    49 
##  10.1  10.2  10.8  10.4  11.2  10.3  11.3  11.4     9  11.5  11.8  10.6 
##    47    46    42    41    36    33    32    32    30    30    29    28 
##  10.7  11.1   9.1  11.7    12  12.5  11.9  12.8  11.6  12.1  12.4  12.2 
##    27    27    23    23    21    21    20    17    15    13    13    12 
##  12.3  12.7  12.9    14  12.6    13  13.6  13.3  13.4   8.4   8.7   8.8 
##    12     9     9     7     6     6     4     3     3     2     2     2 
##  9.55 10.03 10.55  13.1   8.5  9.05  9.23  9.25  9.57  9.95 10.75 11.07 
##     2     2     2     2     1     1     1     1     1     1     1     1 
## 11.95  13.2  13.5 13.57  14.9 
##     1     1     1     1     1

Wines typically have at least 9 % alcohol, around 10 % being the average and number of wines slowly decreasing as the alcohol content increases. Wine makers seem to prefer round numbers in alchohol content. There are spikes in the distribution around every .0 and .5.

Univariate Analysis

What is the structure of your dataset?

The red wine dataset consists of 1599 observations of 12 variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, and quality). Quality is an ordered categorical variable on a scale from 3 to 8, larger values being the better. Other variables are continuous.

Most wines (82.5 %) have a quality rating of 5 or 6. 7 is the third most common rating (12.4 %) while all the other quality scores cover only 5 % of the wines. Red wine is acidic (pH 2.7-4.0) and usually has only little residual sugar. Mean alcohol content of wines is 10.4 %.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest is quality. I’d like to be able to classify wines to high (quality 7 or 8) and low quality (quality 6 or lower) categories based on some combination of physical measures.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Based on the shapes of distributions, volatile acidity, citric acid, cholrides and alcohol seem promising. Especially alcohol and citric acid distributions feature curious spikes at round values, suggesting the wine makers might be aiming to have specific characteristics on these features, which implies the winemakers believe those features have something to do with the quality of the wine.

Did you create any new variables from existing variables in the dataset?

I created variable fixed sulfur dioxide by subracting free sulfur dioxide from total sulfur dioxide. I also combined quality categories into a new binary variable quality.bin. In this variable ‘high’ is assigned to wines with quality 7 or 8 and the ‘low’ is assigned to all other wines.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Several of the distributions were right-skewed. I tried a few transformations (logarithmic, cubic root, power) on some of them, but the shapes of the distributions did not improve. In the end I used the features as they are. (With hindsight, classification models could have benefitted from normalization.)

Bivariate Plots Section

Correlations and summaries by quality

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
## fixed.sulfur.dioxide   -0.07814929      0.097033939  0.06677604
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
## fixed.sulfur.dioxide    0.174529035  0.055479649         0.425148917
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
## fixed.sulfur.dioxide           0.95768634  0.09513464 -0.10805328
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000
## fixed.sulfur.dioxide  0.032244043 -0.22320257 -0.20546298
##                      fixed.sulfur.dioxide
## fixed.acidity                 -0.07814929
## volatile.acidity               0.09703394
## citric.acid                    0.06677604
## residual.sugar                 0.17452903
## chlorides                      0.05547965
## free.sulfur.dioxide            0.42514892
## total.sulfur.dioxide           0.95768634
## density                        0.09513464
## pH                            -0.10805328
## sulphates                      0.03224404
## alcohol                       -0.22320257
## quality                       -0.20546298
## fixed.sulfur.dioxide           1.00000000
## wine[, 14]: Low
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 4.600   Min.   :0.160    Min.   :0.0000   Min.   : 0.900  
##  1st Qu.: 7.100   1st Qu.:0.420    1st Qu.:0.0825   1st Qu.: 1.900  
##  Median : 7.800   Median :0.540    Median :0.2400   Median : 2.200  
##  Mean   : 8.237   Mean   :0.547    Mean   :0.2544   Mean   : 2.512  
##  3rd Qu.: 9.100   3rd Qu.:0.650    3rd Qu.:0.4000   3rd Qu.: 2.600  
##  Max.   :15.900   Max.   :1.580    Max.   :1.0000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.03400   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07100   1st Qu.: 8.00       1st Qu.: 23.00      
##  Median :0.08000   Median :14.00       Median : 39.50      
##  Mean   :0.08928   Mean   :16.17       Mean   : 48.29      
##  3rd Qu.:0.09100   3rd Qu.:22.00       3rd Qu.: 65.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :165.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9958   1st Qu.:3.210   1st Qu.:0.5400   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6000   Median :10.00  
##  Mean   :0.9969   Mean   :3.315   Mean   :0.6448   Mean   :10.25  
##  3rd Qu.:0.9979   3rd Qu.:3.410   3rd Qu.:0.7000   3rd Qu.:10.90  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##  fixed.sulfur.dioxide
##  Min.   :  3.00      
##  1st Qu.: 12.00      
##  Median : 23.00      
##  Mean   : 32.11      
##  3rd Qu.: 42.00      
##  Max.   :128.00      
## -------------------------------------------------------- 
## wine[, 14]: High
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar 
##  Min.   : 4.900   Min.   :0.1200   Min.   :0.0000   Min.   :1.200  
##  1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000   1st Qu.:2.000  
##  Median : 8.700   Median :0.3700   Median :0.4000   Median :2.300  
##  Mean   : 8.847   Mean   :0.4055   Mean   :0.3765   Mean   :2.709  
##  3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900   3rd Qu.:2.700  
##  Max.   :15.600   Max.   :0.9150   Max.   :0.7600   Max.   :8.900  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 3.00       Min.   :  7.00      
##  1st Qu.:0.06200   1st Qu.: 6.00       1st Qu.: 17.00      
##  Median :0.07300   Median :11.00       Median : 27.00      
##  Mean   :0.07591   Mean   :13.98       Mean   : 34.89      
##  3rd Qu.:0.08500   3rd Qu.:18.00       3rd Qu.: 43.00      
##  Max.   :0.35800   Max.   :54.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9906   Min.   :2.880   Min.   :0.3900   Min.   : 9.20  
##  1st Qu.:0.9947   1st Qu.:3.200   1st Qu.:0.6500   1st Qu.:10.80  
##  Median :0.9957   Median :3.270   Median :0.7400   Median :11.60  
##  Mean   :0.9960   Mean   :3.289   Mean   :0.7435   Mean   :11.52  
##  3rd Qu.:0.9973   3rd Qu.:3.380   3rd Qu.:0.8200   3rd Qu.:12.20  
##  Max.   :1.0032   Max.   :3.780   Max.   :1.3600   Max.   :14.00  
##  fixed.sulfur.dioxide
##  Min.   :  4.00      
##  1st Qu.:  9.00      
##  Median : 14.00      
##  Mean   : 20.91      
##  3rd Qu.: 22.00      
##  Max.   :251.50

Volatile acidity, citric acid, sulphates and alcohol have moderate correlations with wine quality. These features also have have noticably different means between low and high quality wines.

Quality by fixed acidity

No clear trends here. Poor and good wines seem to have higher fixed acidity, but on the other hand there are only a few data points on them, so the effect does not feel very trustworthy.

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   7.100   7.800   8.237   9.100  15.900 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.700   8.847  10.100  15.600

Comparing only two quality categories reveals that actually high quality wines tend to have higher fixed acidity. Especially the difference between median values is noticable. Combining quality categories is starting to look like a good idea.

Quality by volatile acidity

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.160   0.420   0.540   0.547   0.650   1.580 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4055  0.4900  0.9150

There is a clear decreasing trend with volatile acidity when the wine quality increases. High quality wines have lower values in all quartiles.

With the increasing quality the distribution of volatile acidity moves to left and gets narrower.

The lower the volatile acidity, the more likely the wine is to be of high quality. Looks like about 0.38 volatile acidity is the sweet spot for red wines.

Quality by citric acid

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0825  0.2400  0.2544  0.4000  1.0000 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3000  0.4000  0.3765  0.4900  0.7600

The very best wines tend to have higher amounts of citric acid.

Interesting! On average good wines tend to have a lot of citric acid, but the density plot reveals the picture is more complex. There seems to be three kinds of wines regarding citric acid: low (close to 0), medium (~0.25) and high (~0.4) amounts of citric acid. Good wines have either a little or a lot of citric acid, while other wines can have any amount of it.

Quality by residual sugar

Lots of outliers..

ggplot(wine, aes(as.factor(quality), residual.sugar)) + 
  geom_boxplot() +
   ylim(0, 4)
## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(quality.bin, residual.sugar)) + 
  geom_boxplot() + 
  ylim(0, 4)
## Warning in loop_apply(n, do.ply): Removed 125 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(residual.sugar, color = quality.bin)) + 
  geom_density()

by(wine$residual.sugar, wine$quality.bin, summary)
## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.512   2.600  15.500 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.709   2.700   8.900

Nothing interesting going on here.

Quality by chlorides

Again many outliers.

ggplot(wine, aes(as.factor(quality), chlorides)) + 
  geom_boxplot() +
  ylim(0, 0.2)
## Warning in loop_apply(n, do.ply): Removed 41 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(quality.bin, chlorides)) + 
  geom_boxplot() +
  ylim(0, 0.2)
## Warning in loop_apply(n, do.ply): Removed 41 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(chlorides, color = quality.bin)) + 
  geom_density()

by(wine$chlorides, wine$quality.bin, summary)
## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.07100 0.08000 0.08928 0.09100 0.61100 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07591 0.08500 0.35800

Good wines appear to have slightly lower amounts of chlorides overall: 0.076 vs 0.089 on average. However, there is a lot of overlap between the distributions.

Quality by sulfur dioxide

Quite many outliers.

ggplot(wine, aes(as.factor(quality), free.sulfur.dioxide)) + 
  geom_boxplot() + 
  ylim(0, 30)
## Warning in loop_apply(n, do.ply): Removed 163 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(quality.bin, free.sulfur.dioxide)) + 
  geom_boxplot() +
  ylim(0, 30)
## Warning in loop_apply(n, do.ply): Removed 163 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(free.sulfur.dioxide, color = quality.bin)) + 
  geom_density()

by(wine$free.sulfur.dioxide, wine$quality, summary)
## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

Average quality wines seem to have a little more free sulfur dioxide on average, but this does not help much in differentiating high quality wines from others. Poor wines (quality score 4) and the best wines (quality score 8) have about the same amount of free sulfur dioxide on average.

Outliers.

ggplot(wine, aes(as.factor(quality), total.sulfur.dioxide)) + 
  geom_boxplot() + 
  ylim(0, 120)
## Warning in loop_apply(n, do.ply): Removed 62 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(quality.bin, total.sulfur.dioxide)) + 
  geom_boxplot() + 
  ylim(0, 120)
## Warning in loop_apply(n, do.ply): Removed 62 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(total.sulfur.dioxide, color = quality.bin)) + 
  geom_density() +
  xlim(0, 120)
## Warning in loop_apply(n, do.ply): Removed 60 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).

by(wine$total.sulfur.dioxide, wine$quality.bin, summary)
## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   39.50   48.29   65.00  165.00 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.00   27.00   34.89   43.00  289.00

Total sulfur dioxide is a better indicator of whether a wine is good or bad. Good wines have almost 30 % less total sulfur dioxide than poor wines, on average.

Total sulfur dioxide by free sulfur dioxide

## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_point).

High quality wines seem to be along a line where amount of total sulfur dioxide compared to free sulful dioxide is low

## Warning in loop_apply(n, do.ply): Removed 163 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 62 rows containing non-finite
## values (stat_boxplot).
## Warning in loop_apply(n, do.ply): Removed 10 rows containing non-finite
## values (stat_boxplot).

The pattern is not very clear, though. There is a lot of overlap between the distributions

## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_point).

Getting better…

## Warning in loop_apply(n, do.ply): Removed 10 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 135 rows containing non-finite
## values (stat_boxplot).

## Warning in loop_apply(n, do.ply): Removed 8 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   23.00   32.11   42.00  128.00 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    9.00   14.00   20.91   22.00  251.50
##         X fixed.acidity volatile.acidity citric.acid residual.sugar
## 1080 1080           7.9              0.3        0.68            8.3
## 1082 1082           7.9              0.3        0.68            8.3
##      chlorides free.sulfur.dioxide total.sulfur.dioxide density   pH
## 1080      0.05                37.5                  278 0.99316 3.01
## 1082      0.05                37.5                  289 0.99316 3.01
##      sulphates alcohol quality quality.bin fixed.sulfur.dioxide
## 1080      0.51    12.3       7        High                240.5
## 1082      0.51    12.3       7        High                251.5

Fixed sulfur dioxide is even better discriminator than total sulfur dioxide! Good wines have low amounts of fixed sulfur dioxide. The differences between values for quantiles, mean and median are around 30 %. There are two extreme outliers, that seem to be related or duplicated wines. The only difference is in amount of total sulfur dioxide.

## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 34 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 5 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).

## Warning in loop_apply(n, do.ply): Removed 40 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing non-finite
## values (stat_density).

Looks like low amounts of fixed sulfur dioxide is a pre-requisite but not a guarantee for a high wine quality.

Quality by density

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9958  0.9968  0.9969  0.9979  1.0040 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9947  0.9957  0.9960  0.9974  1.0030

Higher quality wines seem to have lower density. They also have more alcohol, which could cause the correlation. It is probably a good idea to explore how different things affect the density of the wine.

Quality by pH

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.315   3.410   4.010 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.270   3.289   3.380   3.780

Better wines tend to have slightly lower pH, perhaps in connection to better wines having often higher fixed acidity. The pattern is not very clear though.

Quality by sulphates

Outliers ruining a plot again.

ggplot(wine, aes(as.factor(quality), sulphates)) + 
  geom_boxplot() + 
  ylim(0, 1)
## Warning in loop_apply(n, do.ply): Removed 58 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(quality.bin, sulphates)) + 
  geom_boxplot() + 
  ylim(0, 1)
## Warning in loop_apply(n, do.ply): Removed 58 rows containing non-finite
## values (stat_boxplot).

ggplot(wine, aes(sulphates, color = as.factor(quality))) + 
  geom_density() + 
  xlim(0, 1.5)
## Warning in loop_apply(n, do.ply): Removed 1 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 4 rows containing non-finite
## values (stat_density).
## Warning in loop_apply(n, do.ply): Removed 3 rows containing non-finite
## values (stat_density).

ggplot(wine, aes(sulphates, color = quality.bin)) + 
  geom_density() + 
  xlim(0, 1.5)
## Warning in loop_apply(n, do.ply): Removed 8 rows containing non-finite
## values (stat_density).

by(wine$sulphates, wine$quality.bin, summary)
## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5400  0.6000  0.6448  0.7000  2.0000 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7435  0.8200  1.3600

Higher amounts of sulphates are associated with higher quality, but there are many outliers in average quality wines that muddy the relationship.

Quality by alcohol

The pattern with alcohol is a little bit U-shaped. The worst quality wines tend to have more alcohol than average wines, and then the better than average wines have increasing amounts of alcohol.

With lower resolution the pattern becomes clearer. Better wines tend to have higher amounts of alcohol.

## wine$quality.bin: Low
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## wine$quality.bin: High
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00

Wines with more than 12 % alcohol are likely to have high quality, and wines with less than 10 % alcohol are likely poor.

Composition of density

Many of the features are associated with density, which makes sense. Fixed acidity and alcohol seem to have the strongest association. pH has strong correlations with acidity measures, so its association with density is likely to be result of that.

Composition of pH and acidity

The higher the acidity, the lower the pH. Surprisingly, higher volatile acidity has a weak correlation with higher pH. Maybe volatile acids are “escaping” from the wine?

Fixed and volatile acidity do not have much to do with each other, but citric acid has to do with both of them! Citric acid has positive correlation with fixed acidity and negative correlation with volatile acidity. These relationships look somewhat nonlinear.

Looks like fixed acidity is linearly related to citric acid to some power, perhaps 4.

Here the relationship looks most linear when citric acid values are squared, but the approximation is very rough.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Merging quality categories to just high (7 or 8) and low (6 or below) turned out to be helpful in clarifying the differences between wines. In summary, high quality wines tend to have relatively:

  • low volatile acidity (around 0.4)
  • either low or high amounts of citric acid (around 0.1 or 0.5)
  • low amounts of fixed sulfur dioxide
  • often highish amounts of sulphates
  • on average 11.5 % alcohol compared to 10.25 % in lower quality wines

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

In addition to relationships between physical measures and quality I investigated the composition of acidity in more detail, because two acidity measures correlated with quality, and they interact with each other. Interestingly volatile acidity has negative correlation and fixed acidity has positive correlation with citric acid, but volatile and fixed acidity do not correlate much with each other. The relationships seem to be linear in some power of citric acid, perhaps around 2 (volatile acidity) and 4 (fixed acidity). I also looked at the composition of density, which is likely a result of other measured physical properties.

What was the strongest relationship you found?

The strongest correlation I found was between total sulfur dioxide and fixed sulfur dioxide, but the correlation is a result of how the variable was created. After that fixed acidity and pH have the highest correlation (-0.68). Other similarly strong correlations include:

  • fixed acidity and citric acid
  • fixed acidity and density
  • free sulfur dioxide and total sulfur dioxide

However, for the most interesting correlation is between alcohol and quality (0.48). High quality wines tend to have a lot of alcohol.

Multivariate Plots Section

Acidity vs. quality

## wine$quality.bin: Low
##                  fixed.acidity volatile.acidity
## fixed.acidity        1.0000000       -0.2313619
## volatile.acidity    -0.2313619        1.0000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##                  fixed.acidity volatile.acidity
## fixed.acidity        1.0000000       -0.2651239
## volatile.acidity    -0.2651239        1.0000000
## wine$quality.bin: Low
##                  volatile.acidity citric.acid
## volatile.acidity        1.0000000  -0.5313932
## citric.acid            -0.5313932   1.0000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##                  volatile.acidity citric.acid
## volatile.acidity         1.000000   -0.494798
## citric.acid             -0.494798    1.000000
## wine$quality.bin: Low
##               fixed.acidity citric.acid
## fixed.acidity     1.0000000   0.6522584
## citric.acid       0.6522584   1.0000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##               fixed.acidity citric.acid
## fixed.acidity     1.0000000   0.7452792
## citric.acid       0.7452792   1.0000000

Plotting the wines based on their acidity measures reveals two clusters of high-quality wines:

  • low fixed acidity and citric acid, high volatile acidity
  • low volatile acidity, high fixed acidity and citric acid

Although there is overlap, many low quality wines could already be identified from this plot: wines presented with red dots above the black line and grey dots below it are likely to be of poor quality (the location of the line is approximate and only for illustration). Interestingly the correlations between acidity do not change markedly between the quality classes. Effect is probably an interaction between features.

## wine$quality.bin: Low
##               acidity.inter citric.acid
## acidity.inter    1.00000000 -0.03304717
## citric.acid     -0.03304717  1.00000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##               acidity.inter citric.acid
## acidity.inter     1.0000000  -0.1986388
## citric.acid      -0.1986388   1.0000000

Low quality wines have almost no correlation between citric.acid and the new interaction term, while in high quality wines the correlation is moderate.

Sulphates, sulfur dioxide and alcohol vs. quality

## Warning in loop_apply(n, do.ply): Removed 4 rows containing missing values
## (geom_point).
## Warning in loop_apply(n, do.ply): Removed 2 rows containing missing values
## (geom_point).

## wine$quality.bin: Low
##                       sulphates fixed.sulfur.dioxide
## sulphates            1.00000000           0.07797316
## fixed.sulfur.dioxide 0.07797316           1.00000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##                        sulphates fixed.sulfur.dioxide
## sulphates             1.00000000          -0.06161471
## fixed.sulfur.dioxide -0.06161471           1.00000000
## wine$quality.bin: Low
##                         alcohol fixed.sulfur.dioxide
## alcohol               1.0000000           -0.2388809
## fixed.sulfur.dioxide -0.2388809            1.0000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##                        alcohol fixed.sulfur.dioxide
## alcohol              1.0000000            0.1620387
## fixed.sulfur.dioxide 0.1620387            1.0000000
## wine$quality.bin: Low
##            sulphates    alcohol
## sulphates 1.00000000 0.02220524
## alcohol   0.02220524 1.00000000
## -------------------------------------------------------- 
## wine$quality.bin: High
##             sulphates     alcohol
## sulphates  1.00000000 -0.05229298
## alcohol   -0.05229298  1.00000000

Good quality wines form a rather tight cluster. Again it is possible to identify many poor quality wines visually: any grey wine and all wines above the black line are likely to be of poor quality. This time correlations between features are different for different wine qualities. For instance, in low quality wines increases in fixed sulfur dioxide are negatively correlated with alcohol, but in high quality wines the relationship is reverse. Correlations between sulphates and fixed sulfur dioxide ans sulphates and alchol show similar patterns, but to a lesser extent. Classification algorithms could probably do a good job at identifying high-quality wines (quality 7 or 8) using the following features:

  • alcohol
  • fixed acidity
  • volatile acidity
  • citric acid
  • fixed sulfur dioxide
  • sulphates

Classification models

Next three classification models are trained on the promising features and their performance is assesed. The goal is to achieve a proof of concept for identifying high quality wines based on the chemical properties. Therefore the models are used ‘out of the box’. Model parameters are not optimized and outliers are not removed from the data. The purpose of the models is just to quickly validate the hypothesis that the identified promising features are useful for predicting wine quality.

First the data is randomly split to training and test sets. 70 % of the data is used for training the models and the rest is spared for evaluating their accuracy. The classification models used are k-nearest neighbors (KNN), support vector machine (SVM) and random forest. They are all well-known algorithms that usually perform well, and use different approaches in modeling.

Following parameters for the models are used.

KNN: k = 3. This a typical value for number of neighbors to consider in prediction.

SVM: scale = TRUE, kernel = ‘radial’, gamma = 1/6 (1/(data dimension)), cost (C) = 1. These parameters are the defaults in support vector machine implementation in e1071 package. Features are normalized, and a radial kernel is used.

Random Forest: ntree = 500, mtry = sqrt(6), replace = TRUE, cutoff = 1/2, nodesize = 1. There parameters are the defaults for randomForest in randomForest package. 500 trees are grown to the maximum size, where minimum number of nodes in a leaf is 1.

## K nearest neighbors predictions:
##             
## prediction_1 Low High
##         Low  391   40
##         High  26   23
## [1] 0.86
## Support vector machine predictions:
##             
## prediction_2 Low High
##         Low  408   50
##         High   9   13
## [1] 0.88
## Random forest predictions:
##             
## prediction_3 Low High
##         Low  405   32
##         High  12   31
## [1] 0.91

Indeed, k nearest neigbors, support vector machine and random forest all work pretty well even without any optimization. In this case random forest has the best performance, achieving 91 % classification accuracy on the test set. Precision of identifying high quality wines is 0.49 and recall is 0.72. In practical terms, if the random forest model is correct about half of the time when it predicts a wine is good, and about 97 % of the time when it predicts a wine is not good. Because the baseline changes of a wine being of high quality in the test set is only 13 %, the result is a significant improvement.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Combination of different acidity measures (fixed acidity, volatile acidity and citric acid) turned out to be useful in visually differentiating between high and low quality wines, as did the combination of fixed sulfur dioxide, sulphates and alcohol. The clustering looked much tighter than I had expected based on the bivariate comparisons. After seeing these plots it was not a surprise that classification models performed well at predicting wine quality.

Were there any interesting or surprising interactions between features?

The biggest surprise was the interaction between sulphates and fixed sulphur dioxide. Neither of them was a strong canditate as a predictor, but together they collected high-quality wines in a tight cluster.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

I tried three classification models on the promising features identified during the exploratory analysis. All of them worked well “out of the box”, acchieving around 90 % accuracy. In this case random forest had the best performance with 91 % accuracy on the test set. This means that based on six physical measures of the wine, the random forest model can correctly predict nine times out of ten whether the wine is of high quality. The performance of models could likely be improved further by little optimization. For instance, k-value in k-nearest neighbors model was pulled from a hat, and other models were fitted with default parameters.


Final Plots and Summary

This report set out to explore the red wine dataset originally collected by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis (Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236). The goal was to find out whether it would be possible to identify high quality wines, as defined by subjective expert evaluations, just based on a few of their measurable chemical properties. The dataset consists of 11 physical measurements of 1599 Portuguese vinho verde red wines.

Plot One

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000
## [1] 0.8075694
## 
##  Low (3-6) High (7-8) 
##       1382        217

Description One

Wine quality is measured on a discrete scale from 0 (very bad) to 10 (very excellent). The distribution of quality scores for the wines is bell-shaped, with median 6 and standard deviation 0.81. Extreme scores are not used and in practice wine quality varies between 3 and 8. As many of the categories have relatively few observations, and the main interest is in differentiating good wines from the rest, it makes sense to combine categories. Only a minority of wines - 217 out of 1599, or 13.5 % - is of high quality.

Plot Two

## wine$quality.bin: Low (3-6)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.00   10.25   10.90   14.90 
## -------------------------------------------------------- 
## wine$quality.bin: High (7-8)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.60   11.52   12.20   14.00
## [1] 0.4761663

Description Two

Amount of alcohol has the strongest association with wine quality. Better wines tend to have more alcohol, 11.5 % on average, in contrast to 10.25 % in low quality wines. The correlation between wine quality (original 10-point scale) and amount of alcohol is 0.48. Wines with more than 11 % alcohol are likely to be high quality wines.

Plot Three

## wine$quality.bin: Low (3-6)
##  fixed.acidity    volatile.acidity  citric.acid    
##  Min.   : 4.600   Min.   :0.160    Min.   :0.0000  
##  1st Qu.: 7.100   1st Qu.:0.420    1st Qu.:0.0825  
##  Median : 7.800   Median :0.540    Median :0.2400  
##  Mean   : 8.237   Mean   :0.547    Mean   :0.2544  
##  3rd Qu.: 9.100   3rd Qu.:0.650    3rd Qu.:0.4000  
##  Max.   :15.900   Max.   :1.580    Max.   :1.0000  
## -------------------------------------------------------- 
## wine$quality.bin: High (7-8)
##  fixed.acidity    volatile.acidity  citric.acid    
##  Min.   : 4.900   Min.   :0.1200   Min.   :0.0000  
##  1st Qu.: 7.400   1st Qu.:0.3000   1st Qu.:0.3000  
##  Median : 8.700   Median :0.3700   Median :0.4000  
##  Mean   : 8.847   Mean   :0.4055   Mean   :0.3765  
##  3rd Qu.:10.100   3rd Qu.:0.4900   3rd Qu.:0.4900  
##  Max.   :15.600   Max.   :0.9150   Max.   :0.7600
## wine$quality.bin: Low (3-6)
##               acidity.inter citric.acid
## acidity.inter    1.00000000 -0.03304717
## citric.acid     -0.03304717  1.00000000
## -------------------------------------------------------- 
## wine$quality.bin: High (7-8)
##               acidity.inter citric.acid
## acidity.inter     1.0000000  -0.1986388
## citric.acid      -0.1986388   1.0000000

Description Three

Three acidity measures, fixed acidity (tartaric acid), volatile acidity (acetic acid), and, and citric acid, differentiate high-quality wines from low-quality wines rather well, although the relationship is not straight-forward. On average high quality wines tend to have higher fixed acidity (8.8 vs. 8.2 g/l) and citric acid (0.38 vs. 0.25 g/l), and lower volatile acidity (0.41 vs. 0.55 g/l) than low quality wines. However, the largest differences come up when interactions between different acidity measures are taken into account. Low quality wines show almost no correlation (-0.03) between the interaction term of volatile and fixed acidity (volatile acidity times fixed acidity) and citric acid, which means that low quality wines can have wide range of citric acid levels regardless of the the interaction of the two other measures. With high quality wines this picture changes. The amount of citric acid tends to decrease in high quality wines as the value of interaction term between volatile and fixed acidity increases. The correlation is moderate -0.20. As a result of these interactions, it is possible to identify clusters of high and low quality wines even visually, as the above plot demonstrates. Dashed lines help illustrate the borders of distinct clusters. In the plot, wines represented by red dots above the dashed line and by grey dots below it are likely to have low quality. The situation is similar whe interactions between amounts of sulphates, fixed sulfur dioxide and alcohol are investigated.

Finally, the usability of the promising features (fixed acidity, volatile acidity, citric acid, sulphates, fixed sulfur dioxide and alcohol) for identification of high quality wines was tested by trying to predict the wine quality using k-nearest neigbors, support vector machine and random forest algorithms. The purpose of these models was to act as a proof of concept, so parameter optimization and outlier removal were skipped. Still, the best performing algorithm, random forest, was able to acchieve 91 % overall classification accuracy, and 49 % precision and 72 % recall on identifying high qualty wines on the test set.


Reflection

The dataset I explored contained physical measurements of 1599 red wines, along with subjective quality scores. I started by investigating the distributions of individual variables. After that I identified promising correlations between the variables in an effort to find a set of features that could be used to predict wine quality. Instead of the full quality scale I was only interested in differentiating good wines (score 7 or 8) from the rest. Initially the dataset felt confusing and it didn’t look like there was any interesting patterns, but systematically plotting comparisons of variables slowly revealed many interesting relationships. During the analysis the largest surprise was that some of the variables that did not look very good predictors alone, worked very well when combined together. In the end I used six identified features to build three classification models. Random forest performed best, achieving 91 % classification accuracy on the test set. The performance of the models could be improved by optimization and possibly by adding new features. Some of the less promising features could still contain useful information.

I made a couple of detours during the analysis by investigating the relationships of density and pH with other variables they had high correlations with, and by looking to interactions of different acidity measures in detail. In the end I did not get anything useful out of this exploration, except perhaps the decision to leave density and pH out of further analysis, because the information they contain seemed likely to be already accounted for by other variables.